AITopics | data batch

Collaborating Authors

data batch

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Rewriting History: A Recipe for Interventional Analyses to Study Data Effects on Model Behavior

Nadkarni, Rahul, Elazar, Yanai, Gonen, Hila, Smith, Noah A.

arXiv.org Artificial IntelligenceOct-17-2025

We present an experimental recipe for studying the relationship between training data and language model (LM) behavior. We outline steps for intervening on data batches -- i.e., ``rewriting history'' -- and then retraining model checkpoints over that data to test hypotheses relating data to behavior. Our recipe breaks down such an intervention into stages that include selecting evaluation items from a benchmark that measures model behavior, matching relevant documents to those items, and modifying those documents before retraining and measuring the effects. We demonstrate the utility of our recipe through case studies on factual knowledge acquisition in LMs, using both cooccurrence statistics and information retrieval methods to identify documents that might contribute to knowledge learning. Our results supplement past observational analyses that link cooccurrence to model behavior, while demonstrating that extant methods for identifying relevant training documents do not fully explain an LM's ability to correctly answer knowledge questions. Overall, we outline a recipe that researchers can follow to test further hypotheses about how training data affects model behavior. Our code is made publicly available to promote future work.

computational linguistic, large language model, machine learning, (16 more...)

arXiv.org Artificial Intelligence

2510.14261

Country:

Europe (1.00)
North America > United States (0.67)
North America > Canada (0.46)
Asia > Middle East (0.46)

Genre:

Research Report > New Finding (1.00)
Research Report > Experimental Study (1.00)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Machine Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.86)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.69)

Add feedback

Learning from Future: A Novel Self-Training Framework for Semantic Segmentation Y e Du

Neural Information Processing SystemsOct-2-2025, 20:33:12 GMT

Self-training has shown great potential in semi-supervised learning.

artificial intelligence, machine learning, segmentation, (17 more...)

Neural Information Processing Systems

Country:

Asia > China > Shanghai > Shanghai (0.04)
Asia > China > Zhejiang Province > Hangzhou (0.04)
Asia > China > Hong Kong (0.04)

Genre: Research Report > New Finding (0.46)

Industry: Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.68)

Add feedback

NIRANTAR: Continual Learning with New Languages and Domains on Real-world Speech Data

Javed, Tahir, Bhogale, Kaushal, Khapra, Mitesh M.

arXiv.org Artificial IntelligenceJul-2-2025

We introduce Nirantar, a comprehensive framework for evaluating continual learning (CL) in multilingual and multi-domain ASR. Designed to reflect real-world CL challenges, Nirantar leverages data collected incrementally across 22 languages and 208 districts in India through natural episodes. This enables evaluation across Language-Incremental (LIL), Domain-Incremental (DIL), and the novel Language-Incremental Domain-Incremental Learning (LIDIL) scenarios. Unlike prior work that relies on simulated episodes, Nirantar presents dynamic, non-uniform language and domain shifts, making it an ideal testbed for CL research. With 3250 hours of human-transcribed speech, including 1720 hours newly introduced in this work, our framework enables systematic benchmarking of CL methods. We evaluate existing approaches and demonstrate that no single method performs consistently well, underscoring the need for more robust CL strategies.

artificial intelligence, machine learning, speech recognition, (14 more...)

arXiv.org Artificial Intelligence

2507.00534

Country: Asia > India (0.24)

Genre: Research Report (0.64)

Industry: Media (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

A Two-Stage Data Selection Framework for Data-Efficient Model Training on Edge Devices

Gong, Chen, Xing, Rui, Zheng, Zhenzhe, Wu, Fan

arXiv.org Artificial IntelligenceJun-11-2025

The demand for machine learning (ML) model training on edge devices is escalating due to data privacy and personalized service needs. However, we observe that current on-device model training is hampered by the under-utilization of on-device data, due to low training throughput, limited storage and diverse data importance. To improve data resource utilization, we propose a two-stage data selection framework {\sf Titan} to select the most important data batch from streaming data for model training with guaranteed efficiency and effectiveness. Specifically, in the first stage, {\sf Titan} filters out a candidate dataset with potentially high importance in a coarse-grained manner.In the second stage of fine-grained selection, we propose a theoretically optimal data selection strategy to identify the data batch with the highest model performance improvement to current training round. To further enhance time-and-resource efficiency, {\sf Titan} leverages a pipeline to co-execute data selection and model training, and avoids resource conflicts by exploiting idle computing resources. We evaluate {\sf Titan} on real-world edge devices and three representative edge computing tasks with diverse models and data modalities. Empirical results demonstrate that {\sf Titan} achieves up to $43\%$ reduction in training time and $6.2\%$ increase in final accuracy with minor system overhead, such as data processing delay, memory footprint and energy consumption.

artificial intelligence, data sample, machine learning, (15 more...)

arXiv.org Artificial Intelligence

doi: 10.1145/3711896.3736823

2505.16563

Country:

North America > Canada (0.16)
Asia > China (0.15)

Genre: Research Report > New Finding (0.48)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Personal Assistant Systems (0.66)

Add feedback

Distribution Alignment for Fully Test-Time Adaptation with Dynamic Online Data Streams

Wang, Ziqiang, Chi, Zhixiang, Wu, Yanan, Gu, Li, Liu, Zhi, Plataniotis, Konstantinos, Wang, Yang

arXiv.org Artificial IntelligenceJul-16-2024

Given a model trained on source data, Test-Time Adaptation (TTA) enables adaptation and inference in test data streams with domain shifts from the source. Current methods predominantly optimize the model for each incoming test data batch using self-training loss. While these methods yield commendable results in ideal test data streams, where batches are independently and identically sampled from the target distribution, they falter under more practical test data streams that are not independent and identically distributed (non-i.i.d.). The data batches in a non-i.i.d. stream display prominent label shifts relative to each other. It leads to conflicting optimization objectives among batches during the TTA process. Given the inherent risks of adapting the source model to unpredictable test-time distributions, we reverse the adaptation process and propose a novel Distribution Alignment loss for TTA. This loss guides the distributions of test-time features back towards the source distributions, which ensures compatibility with the well-trained source model and eliminates the pitfalls associated with conflicting optimization objectives. Moreover, we devise a domain shift detection mechanism to extend the success of our proposed TTA method in the continual domain shift scenarios. Our extensive experiments validate the logic and efficacy of our method. On six benchmark datasets, we surpass existing methods in non-i.i.d. scenarios and maintain competitive performance under the ideal i.i.d. assumption.

adaptation, data stream, domain adaptation, (17 more...)

arXiv.org Artificial Intelligence

2407.12128

Country:

North America > Canada > Ontario > Toronto (0.14)
Asia > Middle East > Jordan (0.04)
Europe > Netherlands > South Holland > Delft (0.04)
(2 more...)

Genre: Research Report (0.82)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (0.46)

Add feedback

Anomaly Detection of Tabular Data Using LLMs

Li, Aodong, Zhao, Yunhan, Qiu, Chen, Kloft, Marius, Smyth, Padhraic, Rudolph, Maja, Mandt, Stephan

arXiv.org Artificial IntelligenceJun-24-2024

Large language models (LLMs) have shown their potential in long-context understanding and mathematical reasoning. In this paper, we study the problem of using LLMs to detect tabular anomalies and show that pre-trained LLMs are zero-shot batch-level anomaly detectors. That is, without extra distribution-specific model fitting, they can discover hidden outliers in a batch of data, demonstrating their ability to identify low-density data regions. For LLMs that are not well aligned with anomaly detection and frequently output factual errors, we apply simple yet effective data-generating processes to simulate synthetic batch-level anomaly detection datasets and propose an end-to-end fine-tuning strategy to bring out the potential of LLMs in detecting real anomalies. Experiments on a large anomaly detection benchmark (ODDS) showcase i) GPT-4 has on-par performance with the state-of-the-art transductive learning-based anomaly detection methods and ii) the efficacy of our synthetic dataset and fine-tuning strategy in aligning LLMs to this task.

anomaly detection, detection, llm, (13 more...)

arXiv.org Artificial Intelligence

2406.16308

Country: Europe > Germany > Rhineland-Palatinate > Kaiserslautern (0.04)

Genre: Research Report (1.00)

Technology:

Information Technology > Data Science > Data Mining > Anomaly Detection (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Layerwise Proximal Replay: A Proximal Point Method for Online Continual Learning

Yoo, Jason, Liu, Yunpeng, Wood, Frank, Pleiss, Geoff

arXiv.org Artificial IntelligenceFeb-14-2024

In online continual learning, a neural network incrementally learns from a non-i.i.d. data stream. Nearly all online continual learning methods employ experience replay to simultaneously prevent catastrophic forgetting and underfitting on past data. Our work demonstrates a limitation of this approach: networks trained with experience replay tend to have unstable optimization trajectories, impeding their overall accuracy. Surprisingly, these instabilities persist even when the replay buffer stores all previous training examples, suggesting that this issue is orthogonal to catastrophic forgetting. We minimize these instabilities through a simple modification of the optimization geometry. Our solution, Layerwise Proximal Replay (LPR), balances learning from new and replay data while only allowing for gradual changes in the hidden activation of past data. We demonstrate that LPR consistently improves replay-based online continual learning methods across multiple problem settings, regardless of the amount of available replay memory.

continual learning, learning, lpr, (12 more...)

arXiv.org Artificial Intelligence

2402.09542

Country: North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)

Genre:

Research Report (1.00)
Instructional Material > Online (1.00)

Industry:

Education > Educational Setting > Online (0.47)
Education > Educational Setting > Continuing Education (0.46)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)

Add feedback

Fast kernel methods for Data Quality Monitoring as a goodness-of-fit test

Grosso, Gaia, Lai, Nicolò, Letizia, Marco, Pazzini, Jacopo, Rando, Marco, Rosasco, Lorenzo, Wulzer, Andrea, Zanetti, Marco

arXiv.org Artificial IntelligenceMar-9-2023

Modern high-energy physics experiments operating at colliders are extremely sophisticated devices consisting of millions of sensors sampled every few nanoseconds, producing an enormous throughput of complex data. Several types of technologies are employed, devoted to identifying and measuring the particles that originated in the collisions; in all cases, the environmental conditions are severe, making the required performances challenging to achieve. Although the various subsystems are designed to offer redundancy, measurements can be undermined by malfunctions of parts of the experiment, either because of critical inefficiencies or because of possibly misinterpreted spurious signals. In addition to supervising the status (powering, electronic configuration, temperature, etc.) of the various hardware components, data from all sources must thus be monitored continuously to assess their quality and to promptly detect any faults, possibly providing indications about their causes. Given the rate of tens of MHz at which data is gathered and the number of sensors to be checked, the monitoring process needs to be as automated as possible: approaches based on Machine Learning (ML) techniques are particularly suited for this task and have started being employed by the experimental collaborations [1-4], complementing more traditional methods [5-9].

artificial intelligence, batch, machine learning, (19 more...)

arXiv.org Artificial Intelligence

2303.05413

Country:

Europe > Italy > Liguria > Genoa (0.04)
Europe > Spain > Catalonia > Barcelona Province > Barcelona (0.04)
North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
(2 more...)

Genre: Research Report (1.00)

Technology: Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)

Add feedback

A Continual Learning Framework for Adaptive Defect Classification and Inspection

Sun, Wenbo, Kontar, Raed Al, Jin, Judy, Chang, Tzyy-Shuh

arXiv.org Artificial IntelligenceMar-16-2022

Recent development of advanced sensing and high computing technologies has enabled the wide adoption of machine vision to automatically inspect products' dimensional quality for efficient process control and reducing the manual inspection cost. The process control procedure requires effective data analysis methods to provide reliable inspection results. In this paper, we consider a high-volume manufacturing system that uses machine vision at the quality inspection station for automatic classification of product defects. Here classification implies both; identifying a defect and classifying its corresponding type. As a motivating example, we consider the scenario where batches of three-dimensional (3D) point cloud data are independently collected from a manufacturing process. The 3D point cloud data is obtained by measuring the 3D location of points on the product surface using a 3D scanner. The location measurements can then be used for fast classification of surface defects, and thus provide timely feedback for process control. Figure 1 (right) shows some exemplar surface defects on a wood product and the corresponding 3D point cloud measurements. The 3D point cloud measurements have a set of defining characteristics that should be considered in the development of defect classification techniques.

artificial intelligence, defect type, machine learning, (17 more...)

arXiv.org Artificial Intelligence

doi: 10.1080/00224065.2023.2224974

2203.08796

Country: North America > United States > Michigan > Washtenaw County > Ann Arbor (0.04)

Genre: Research Report (0.64)

Industry: Education (0.67)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)

Add feedback

Towards Efficient Scheduling of Federated Mobile Devices under Computational and Statistical Heterogeneity

Wang, Cong, Yang, Yuanyuan, Zhou, Pengzhan

arXiv.org Machine LearningSep-15-2020

Originated from distributed learning, federated learning enables privacy-preserved collaboration on a new abstracted level by sharing the model parameters only. While the current research mainly focuses on optimizing learning algorithms and minimizing communication overhead left by distributed learning, there is still a considerable gap when it comes to the real implementation on mobile devices. In this paper, we start with an empirical experiment to demonstrate computation heterogeneity is a more pronounced bottleneck than communication on the current generation of battery-powered mobile devices, and the existing methods are haunted by mobile stragglers. Further, non-identically distributed data across the mobile users makes the selection of participants critical to the accuracy and convergence. To tackle the computational and statistical heterogeneity, we utilize data as a tuning knob and propose two efficient polynomial-time algorithms to schedule different workloads on various mobile devices, when data is identically or non-identically distributed. For identically distributed data, we combine partitioning and linear bottleneck assignment to achieve near-optimal training time without accuracy loss. For non-identically distributed data, we convert it into an average cost minimization problem and propose a greedy algorithm to find a reasonable balance between computation time and accuracy. We also establish an offline profiler to quantify the runtime behavior of different devices, which serves as the input to the scheduling algorithms. We conduct extensive experiments on a mobile testbed with two datasets and up to 20 devices. Compared with the common benchmarks, the proposed algorithms achieve 2-100x speedup epoch-wise, 2-7% accuracy gain and boost the convergence rate by more than 100% on CIFAR10.

accuracy, artificial intelligence, machine learning, (21 more...)

arXiv.org Machine Learning

2005.12326

Country:

North America > United States > Virginia > Norfolk City County > Norfolk (0.04)
North America > United States > New York > Suffolk County > Stony Brook (0.04)
Asia > China > Shanghai > Shanghai (0.04)
(5 more...)

Genre:

Research Report > New Finding (0.67)
Research Report > Experimental Study (0.45)

Industry: Information Technology > Security & Privacy (1.00)

Technology:

Information Technology > Communications > Mobile (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.68)
Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.68)

Add feedback